{ "cells": [ { "cell_type": "markdown", "id": "c51e8572", "metadata": {}, "source": [ "# Homework 3\n", "\n", "In this homework you will:\n", "* Implement the Naive Bayes algorithm and use it for classification\n", "* Use feature weights learned with logistic regression to assess feature importance" ] }, { "cell_type": "markdown", "id": "3fe0cca2", "metadata": {}, "source": [ "## Naive Bayes\n", "\n", "Recall that the Naive Bayes classifier computes $p(\\textrm{class} = y|x)$ for an instance where $x = (x_1 = v_1, x_2 = v_2, \\ldots, x_n = v_n)$ is a vector of feature values. It does that for every possible value of the class label and chooses the label that yields the largest probability. Concretely, that probability is computed as follows (note the use of the \"proportional to\" symbol $\\propto$ below) because we are ignoring the $p(x)$ term that arised from Bayes rule).\n", "\n", "$p(class = y|x) \\propto p(x_1 = v_1 | class = y) * \n", "p(x_2 = v_2 | class = y) *\n", "\\ldots p(x_n = v_n | class = y) * p(class = y)$\n", "\n", "Below you will implement the Naive Bayes classifier (with lots of supporting routines already provided) and apply it to a dataset of mushrooms where the class label is 'p' for poisonous and 'e' for edible." ] }, { "cell_type": "code", "execution_count": 16, "id": "b367eb96", "metadata": {}, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "markdown", "id": "da040cf2", "metadata": {}, "source": [ "### Load the data" ] }, { "cell_type": "code", "execution_count": 17, "id": "c5e03fb2", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
classcap-shapecap-surfacecap-colorbruisesodorgill-attachmentgill-spacinggill-sizegill-color...stalk-surface-below-ringstalk-color-above-ringstalk-color-below-ringveil-typeveil-colorring-numberring-typespore-print-colorpopulationhabitat
0pxsntpfcnk...swwpwopksu
1exsytafcbk...swwpwopnng
2ebswtlfcbn...swwpwopnnm
3pxywtpfcnn...swwpwopksu
4exsgfnfwbk...swwpwoenag
\n", "

5 rows × 23 columns

\n", "
" ], "text/plain": [ " class cap-shape cap-surface cap-color bruises odor gill-attachment \\\n", "0 p x s n t p f \n", "1 e x s y t a f \n", "2 e b s w t l f \n", "3 p x y w t p f \n", "4 e x s g f n f \n", "\n", " gill-spacing gill-size gill-color ... stalk-surface-below-ring \\\n", "0 c n k ... s \n", "1 c b k ... s \n", "2 c b n ... s \n", "3 c n n ... s \n", "4 w b k ... s \n", "\n", " stalk-color-above-ring stalk-color-below-ring veil-type veil-color \\\n", "0 w w p w \n", "1 w w p w \n", "2 w w p w \n", "3 w w p w \n", "4 w w p w \n", "\n", " ring-number ring-type spore-print-color population habitat \n", "0 o p k s u \n", "1 o p n n g \n", "2 o p n n m \n", "3 o p k s u \n", "4 o e n a g \n", "\n", "[5 rows x 23 columns]" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_csv('mushrooms.csv')\n", "df.head()" ] }, { "cell_type": "code", "execution_count": 18, "id": "ab6b17be", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['cap-shape',\n", " 'cap-surface',\n", " 'cap-color',\n", " 'bruises',\n", " 'odor',\n", " 'gill-attachment',\n", " 'gill-spacing',\n", " 'gill-size',\n", " 'gill-color',\n", " 'stalk-shape',\n", " 'stalk-root',\n", " 'stalk-surface-above-ring',\n", " 'stalk-surface-below-ring',\n", " 'stalk-color-above-ring',\n", " 'stalk-color-below-ring',\n", " 'veil-type',\n", " 'veil-color',\n", " 'ring-number',\n", " 'ring-type',\n", " 'spore-print-color',\n", " 'population',\n", " 'habitat']" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Names of features\n", "FEAT_NAMES = df.columns.to_list()[1:]\n", "FEAT_NAMES" ] }, { "cell_type": "code", "execution_count": 19, "id": "c6606286", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['p', 'e']" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Values for the class labels\n", "LABELS = ['p', 'e']\n", "LABELS" ] }, { "cell_type": "markdown", "id": "e5911a1d", "metadata": {}, "source": [ "### Counting\n", "\n", "The routine below does counting for you. Given a dataframe with mushroom instances, it can return 3 kinds of counts:\n", "* Counts of rows with a given class label\n", "* Counts of rows with a given feature value\n", "* Counts of rows with a given class label and a given feature value\n", "\n", "You can use this routine when computing the probabilities used by NB.\n", "\n", "Read through the code below, and then look at the cells after it to see examples of how to get each of the count types listed above.\n", "\n", "NOTE: I using memoization to store counts so they don't have to be computed from the raw data if that are requested a second time. That's irrelevant to what you need to do. It's just a simple way to avoid repeated work and will make your code significantly faster." ] }, { "cell_type": "code", "execution_count": 20, "id": "fef878f5", "metadata": {}, "outputs": [], "source": [ "MEMO = {}\n", "\n", "def count_rows(df, y = None, feat_name = None, feat_value = None):\n", " \n", " assert y is not None or feat_name is not None, \"You must specify at least a class or feature\"\n", " assert (feat_name is None and feat_value is None) or (feat_name is not None and feat_value is not None), \"Feature names require feature values, and vice versa\"\n", " \n", " key = None\n", " \n", " if y is not None and feat_name is None:\n", " key = 'class=%s' % y\n", " if not key in MEMO:\n", " MEMO[key] = len(df[df['class'] == y])\n", " \n", " if y is not None and feat_name is not None:\n", " key = 'class=%s;%s=%s' % (y, feat_name, feat_value)\n", " if not key in MEMO:\n", " MEMO[key] = len(df[(df['class'] == y) & (df[feat_name] == feat_value)])\n", "\n", " if y is None and feat_name is not None:\n", " key = '%s=%s' % (feat_name, feat_value)\n", " if not key in MEMO:\n", " MEMO[key] = len(df[df[feat_name] == feat_value])\n", "\n", " if key:\n", " return MEMO[key]\n", " \n", " assert True, \"Unexpected error\"\n", " " ] }, { "cell_type": "code", "execution_count": 21, "id": "ddf51b33", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "3916" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Count the number of rows with class label 'p'\n", "count_rows(df, y = 'p')" ] }, { "cell_type": "code", "execution_count": 22, "id": "8079ee57", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2556" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Count the number of rows for which the cap-surface feature has the value 's'\n", "count_rows(df, feat_name = 'cap-surface', feat_value = 's')" ] }, { "cell_type": "code", "execution_count": 23, "id": "2a65748d", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1412" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Count the number of rows for which the class label is 'p' and \n", "# the cap-surface feature has the value 's'\n", "count_rows(df, y = 'p', feat_name = 'cap-surface', feat_value = 's')" ] }, { "cell_type": "code", "execution_count": 24, "id": "e1fc7d1d", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.36\n" ] } ], "source": [ "# Compute p(cap-surface = 's' | class = 'p')\n", "print('%.2f' % (count_rows(df, y = 'p', feat_name = 'cap-surface', feat_value = 's')/count_rows(df, y = 'p')))" ] }, { "cell_type": "markdown", "id": "2df449fc", "metadata": {}, "source": [ "### Split the data into training and testing sets" ] }, { "cell_type": "code", "execution_count": 25, "id": "4237f992", "metadata": {}, "outputs": [], "source": [ "df_train = df[:-1000]\n", "df_test = df[-1000:]" ] }, { "cell_type": "markdown", "id": "81c77de8", "metadata": {}, "source": [ "### How to access features and the class label\n", "\n", "The code below shows how to get the feature names/values for a test instance as well as the true class label.\n", "\n", "The syntax
df_test.iloc[idx]
returns the row in the test dataframe at position idx. Valid positions run from 0 to len(df_test) - 1." ] }, { "cell_type": "code", "execution_count": 26, "id": "1340c6a5", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Test instance 10 has class = e\n", "\n", "Test instance 10 has cap-shape = k\n", "Test instance 10 has cap-surface = s\n", "Test instance 10 has cap-color = g\n", "Test instance 10 has bruises = f\n", "Test instance 10 has odor = n\n", "Test instance 10 has gill-attachment = f\n", "Test instance 10 has gill-spacing = w\n", "Test instance 10 has gill-size = b\n", "Test instance 10 has gill-color = p\n", "Test instance 10 has stalk-shape = e\n", "Test instance 10 has stalk-root = ?\n", "Test instance 10 has stalk-surface-above-ring = k\n", "Test instance 10 has stalk-surface-below-ring = k\n", "Test instance 10 has stalk-color-above-ring = w\n", "Test instance 10 has stalk-color-below-ring = w\n", "Test instance 10 has veil-type = p\n", "Test instance 10 has veil-color = w\n", "Test instance 10 has ring-number = t\n", "Test instance 10 has ring-type = p\n", "Test instance 10 has spore-print-color = w\n", "Test instance 10 has population = s\n", "Test instance 10 has habitat = g\n" ] } ], "source": [ "idx = 10\n", "print('Test instance %d has class = %s\\n' % (idx, df_test.iloc[idx]['class']))\n", "for feat_name in FEAT_NAMES:\n", " feat_value = df_test.iloc[idx][feat_name]\n", " print('Test instance %d has %s = %s' % (idx, feat_name, feat_value))" ] }, { "cell_type": "markdown", "id": "370e9441", "metadata": {}, "source": [ "# Task 1 - Implement Naive Bayes\n", "\n", "Fill in the function below. It takes as input the training set and an instance from the test set (e.g., df_test.iloc[10]) and returns the probabilities of the two classes. I store them in dictionaries, as you can see at the top of my partial implementation. But you can do whatever you want. Note that I initialize them to 1. Think about why I did that. Use count_rows() to get the quantities you need to classify the instance. To make a prediction you will simply choose the probability with the largest value.\n", "\n", "To help debug I've given the probabilities (they are normalized to sum to 1) that I got from my routine for a few test instances." ] }, { "cell_type": "code", "execution_count": 12, "id": "21b852fd", "metadata": {}, "outputs": [], "source": [ "def NB_probs(df_train, instance):\n", " probs = {'p':1, 'e':1}\n", " \n", " # YOUR CODE GOES HERE\n", " \n", " return probs" ] }, { "cell_type": "code", "execution_count": 13, "id": "fbf0653b", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'p': 0.6102078440663387, 'e': 0.38979215593366134}\n", "{'p': 1.0, 'e': 0.0}\n", "{'p': 0.37969627358365715, 'e': 0.6203037264163429}\n" ] } ], "source": [ "# Sample output\n", "print(NB_probs(df_train, df_test.iloc[10]))\n", "print(NB_probs(df_train, df_test.iloc[100]))\n", "print(NB_probs(df_train, df_test.iloc[502]))" ] }, { "cell_type": "markdown", "id": "436237c7", "metadata": {}, "source": [ "# Task 2 - Use NB_probs() to classify \n", "\n", "Write code in the cell below to walk over the test data and classify each instance. **Print the classification accuracy at the end. Also print the index of each instance that is misclassified.** There is a little skeleton code to get you started." ] }, { "cell_type": "code", "execution_count": 27, "id": "a13bb01e", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.0\n" ] } ], "source": [ "num_correct = 0\n", "for idx in range(len(df_test)):\n", " pass\n", "print(num_correct/len(df_test))" ] }, { "cell_type": "markdown", "id": "8af82cbc", "metadata": {}, "source": [ "## Logistic Regression\n", "\n", "Recall that logistic regression learns a weight vector such that $wx >> 0$ for positive instances and $wx << 0$ for negative instances. Below you'll look at the weights that were learned and think about which features are important." ] }, { "cell_type": "code", "execution_count": 28, "id": "c640ca7b", "metadata": {}, "outputs": [], "source": [ "from sklearn.linear_model import LogisticRegression\n", "from sklearn.datasets import load_wine\n", "import matplotlib.pyplot as plt\n", "import numpy as np\n", "from sklearn.metrics import accuracy_score" ] }, { "cell_type": "markdown", "id": "eed522fc", "metadata": {}, "source": [ "### Load the data\n", "\n", "The wine dataset has 13 features that are real valued and **all positive**. That last bit is important for what follows. The goal is to classify a sample of wine characterized by its 13 featues into one of three types of wines." ] }, { "cell_type": "code", "execution_count": 29, "id": "3b4938e1", "metadata": {}, "outputs": [], "source": [ "data = load_wine()\n", "X = data['data']\n", "y = data['target']" ] }, { "cell_type": "markdown", "id": "86f1ddfd", "metadata": {}, "source": [ "### Train a classifier and look at the feature weights\n", "\n", "The plot below shows the weights associated with all 13 features for each of the three classes. The are overlaid so that you can compare weights across classes." ] }, { "cell_type": "code", "execution_count": 30, "id": "64bdb2c8", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/Users/oates/tmp/env/anaconda3/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:763: ConvergenceWarning: lbfgs failed to converge (status=1):\n", "STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.\n", "\n", "Increase the number of iterations (max_iter) or scale the data as shown in:\n", " https://scikit-learn.org/stable/modules/preprocessing.html\n", "Please also refer to the documentation for alternative solver options:\n", " https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression\n", " n_iter_i = _check_optimize_result(\n" ] }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "clf = LogisticRegression(C = 1)\n", "clf.fit(X, y)\n", "x = list(range(13))\n", "plt.plot(x, clf.coef_[0], label=data['target_names'][0])\n", "plt.plot(x, clf.coef_[1], label=data['target_names'][1])\n", "plt.plot(x, clf.coef_[2], label=data['target_names'][2])\n", "plt.legend()\n", "plt.xticks(x, data['feature_names'], rotation ='vertical')\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "6acfcd3f", "metadata": {}, "source": [ "# Task 3 - Answer the following questions\n", "\n", "Given the plot above, give a brief answer (a few sentences to a paragraph) to each of the following questions.\n", "\n", "* Which feature is most important for determining if a sample is class 1?\n", "* Which feature is most important for determining if a sample is class 2?\n", "* If malic_acid is a large number, does that make it more or less likely that the instance belongs to class 1? Why?\n", "* Which two classes are probably the hardest to tell apart? Why?\n", "* If you could only keep two features, which ones would you keep to maximize classification accuracy? Why those two?" ] }, { "cell_type": "code", "execution_count": null, "id": "9339cd55", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.8" } }, "nbformat": 4, "nbformat_minor": 5 }